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ABSTRACT 

This study investigated the effect of writing task 
topic on learner performance in a second-language writing test, in 
this case the Michigan English Language Assessment Battery designed 
to test proficiency in English as a Second Language. The 64 topics or 
"prompts" used in the test (offered as pairs of options) were 
categorized according to writing task type (expository/private; 
expository/public; argumentative/private; argumentative/public; and a 
combination of two or more of the previous types) and the categories 
assigned a level of difficulty. Scores received on the test were then 
correlated with topic Iffioulty. Contrary to expectation, the mean 
writing score increased rather than decreased as topic difficulty 
increased. Implications for test construction and for rater judgment 
are examined. (MSE) 
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THE DIFFICULTIES OF DIFFICULTY: PROMPTS 
IN WRITING ASSESSMENT 

Liz Hamp-Lyons an':' Sheila Prochnow 



5NTRODUCTION 

In the field of writing assessment, a growing educational industry not only in 
the United States but also worldwide, it is often claimed that the "prompt-, the 
question or stimulus to which the student must write a response, is a key 
variable. Maintaining consistent and accurate judgments of writing quality, it is 
argued, requires prompts which are of parallel difficulty. There arc two 
problems with this. First, a survey of the writing assessment literature, in both 
Ll (Benton and Blohm, 1986; Brosscll, 1983; Brosscll and Ash, 1984; Crowhurst 
and Pichc, 1979; Frecdman, 1983; Hoetker and Brossell, 1986, 1989; Pollitt and 
Hutchinson, 1987; Qucllmalz et al, 1982; Ruth and Murphy, 1988; Smith et al, 
1985) and L2 (Carlson et al, 1985; Carlson and Bridgcman, 1986; Chistc and 
O'Shea 1988; Cummings, 1989; Hirokawa and Swales, 1986; Park, 1988; Rcid, 
1989 (in press); Spaan, 1989; Tedick, 1989; Hamp-Lyons, 1990), reveals 
conflicting evidence and opinions on this. Second (and probably causally prior), 
we do not yet have tools which enable us to give good answers to the questions 
of how difficult tasks on writing tests are (Pollitt and Hutchinson, 1985). 
Classical statistical methods have typically been used, but are unable to provide 
sufficiently detailed information about the complex interactions and behaviors 
that underlie writing ability (Hamp-Lyons, 1987). Both g-theory (Bachman, 
1990) and item response theory (Davidson, in press) offer more potential but 
require cither or both costly software and statistical expertise typically not 
available even in moderate-sized testing agencies, and certainly not to most 
schools-based writing assessment programs. 

An entirely different direction in education research at the moment, 
however, is toward the use of judgments, attitude surveys, experiential data such 
as verbal protocols, and a generally humanistic orientation. Looking in such a 
direction we sec that language teachers and essay scorers often feel quite 
strongly that they can judge how difficult or easy a specific writing test prompt is, 
and are frequently heard to say that certain prompts are problematic because 
they arc easier or harder than others. This study attempts to treat such 
observations and judgments as data, looking at the evidence for teachers' and 
raters' claims. If such claims are borne out, judgments could be of important 
help in establishing prompt difficulty prior to large-scale prompt piloting, and 
reducing the problematic need to discard many prompts because of failure at the 

lie njCBAMTUCMT nc crmrAriAi 

pilot stage. 
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H. BACKGROUND 



The MELAB, a test of English lan^vagc proficiency similar to the TOEFL 
but containing a direct writing component, is developed by the Testing Division 
of the University of Michigan's English Language Institute and administered in 
the US and in 120 countries and over 400 cities around the world. In addition to 
the writing component, the test battery includes a listening component and a 
grammar/clozc/vocabulary/rcading component (referred to as "Part 3"). There 
is also an optional speaking component, consisting of an oral interview. Scores 
on the 3 obligatory components are averaged to obtain a final MELAB score, 
and both component and final scores are reported. Scores are used by college or 
university admissions officers and potential employers in the United States in 
making decisions as to whether a candidate is proficient enough to carry out 
academic work or professional duties in English. 

The writing component of the test is a 30-minute impromptu task, for which 
candidates are offered a choice of two topics. Topics arc brief in length, usually 
no more than three or four lines, and intended to be generally accessible in 
content and prior 

assumptions to all candidates. Topic development is an ongoing activity of the 
Testing Division, and prompts arc regularly added to and dropped from the 
topic pool. In preparation of each test administration, topic sets arc drawn from 
the topic pool on a rotating basis, so as to avoid repeated use of any particular 
topic set at any test administration site. Currently, 32 topic sets (i.e. 64 separate 
topics) arc being used in MELAB administrations in the US and abroad and it is 
these topic sets, comprising 64 separate prompts, which examined in this study. 

MELAB compositions arc scored by trained raters using a modified holistic 
scoring system and a ten-point rating scale (sec Appendix 1). Each composition 
is read independently by two readers, and by three when the first two disagree by- 
more than one scale point. The two closest scores are averaged to obtain a final 
writing score. Thus, there arc 19 possible MELAB composition scores (the 10 
scale points and 9 averaged score points falling in between them). Compositions 
from all administration sites are sent to the Testing Division, where they arc 
scored by trained MELAB raters. Inter-rater reliability for the MELAB 
composition is .90. 



II. METHOD 

Since research to date has not defined what makes writing test topics 
difficult or easy, our first step toward obtaining expert judgments had to be to 
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design a scale for rating topic difficulty. Lacking prior models to build on, \vs 
chose a simple scale of 1 to 3, without descriptions for raters to use other than 
1-casy, 2= average difficulty and 3 = hard. Next the scale and rating procedures 
were introduced to 2 trained MELAB composition readers and 2 ESL writing 
experts, who each used the scale to assign difficulty ratings to 64 MELAB topics 
(32 topic sets). The four raters' difficulty ratings were then summed for each 
topic, resulting in one overall difficulty rating per topic, from 4 (complete 
agreement on a l = casy rating) to 12 (complete agreement on a 3-hard rating). 
We then compared "topic difficulty" (the sum of judgments of the difficulty of 
each topic) to actual writing scores obtained on those topics, using 8,497 cases 
taken from MELAB tests administered in the period 1985-89. 

Next, we categorized the 64 prompts according to the type of writing task 
each represents. We began with application of the topic typ v categories 
developed by Bridgcman and Carlson (1983) for their study of university faculty 
topic preferences. However, judges found that of Bridgcman and Carlson's nine 
categories, three were not usable because there were no instances of such topic 
types in the datasct; further., only about half of the datasct fit in the remaining six 
categories. The remaining half of the topics were generally found to call either 
for expository or for argumentative writing. The expository/argumentative 
distinction is of course one which has been made in many previous studies 
(Rubin and Pichc, 1979; Crowhurst and Pichc, 1979; Mohan and Lo, 1985; 
Qucllmalz ct al, 1982; etc). Another noticeable difference between topics is that 
sonic call for the writer to take a public orientation toward the subject matter to 
be discussed whereas others call for a more private orientation. Similar 
distinctions between prompts were noted by Bridgcman and Carlson (1983), who 
discuss differences in their various topic types in terms of what they call "degree 
of personal involvement", and by Hoetkcr and Brossell (1989) in their study of 
variations in degree of rhetorical specification and of "stance" required of the 
writer. 

Based on these distinctions, we created a set of 5 task type categories: (1) 
expository/private; (2) expository/public; (3) argumentative/private; (4) 
argumentative/public, and (5) combination (a topic which calls for more than 
one mode of discourse and/or more than one orientation; an example of such a 
topic might be one which calls for both exposition and argumentation, or one 
which calls for both a personal and public stance, or even one which calls for 
both modes and both orientations). Examples of the five types arc shown in 
Appendix 2. All 64 topics were independently assigned to the category, and then 
the few differences in categorization were resolved througS discussion. 
Following a commonly held assumption often found in the literature (Bridgcman 
and Carlson, 1983; Hoetkcr and Brossell, 1989), we hypothesized that some topic 
type categories would be judged generally more difficult than others, and that 
expository/private topics would, on average, be judged least difficult, and 



ar R umcntativc/public topics most difficult. To test this prediction, we used a 
Sway ana, ys is of variance, setting topic difficulty as the dependent vanabie 
and topic type as the independent variable. 



III. RESULTS and INTERPRETATIONS 



To pic Difficulty 

When we displayed the summed topic difficulties based on four judges- 
scores for each of the 64 prompts, we obtained the result shown in Table 1. 



Table I 

Topic Difficulty for MELAB Prompt! 



Topic 
Difficulty 



Topic 
Set No 



11 

27 

31 

33 

34 

46 

49 

30 

35 

41 

47 

12 

22 

29 

34 

37 

38 

40 

40 

43 

10 

21 

23 

26 

28 

28 

33 

32 

37 

38 

39 

41 



A 

A 

B 

B 

B 

B 

A 

B 

B 

A 

A 

B 

A 

B 

A 

B 

A 

A 

B 

A 

A 

A 

B 

A 

A 

B 

A 

B 

A 

B 

A 

B 



Topic 
Difficulty 

8 
8 
8 
8 
8 
9 
9 
9 
9 
9 
9 
9 
9 
9 
9 
9 
10 
10 
10 
10 
10 
10 
10 
10 
10 
11 
11 
11 
U 
11 
11 
12 



Topic 
Set No. 



42 

43 
44 
45 

49 

12 

18 

21 

22 

23 

24 

31 

33 

35 

46 

50 

11 

13 

24 

29 

30 

39 

42 

45 

47 

10 

13 

18 

26 

27 

44 

50 



A 

B 

B 

A 

B 

A 

B 

B 

B 

A 

B 

A 

A 

A 

A 

B 

B 

A 

A 

A 

A 

B 

B 

B 

B 

B 

B 

A 

B 

B 

A 

A 



Most prompts had a difficulty score around fac n iddle of the overall 
Hifr.ruhv scale fi e 8). This is either because most prompts arc moderately 
difficuU. or ! and more Hkcly, because of the low reliability of our judges 
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judgments. The reliability of Ihc prompl difficully judgments, using Cronbach's 
alpha, was .55. 

And here was our first difficulty, and our first piece of interesting data: it 
seemed that claims that easy readers and language teachers can judge prompt 
difficulty, while not precisely untrue, are also not precisely true, and certainly not 
true enough for a well-grounded statistical study. When we looked at the data to 
discover whether the judgments of topic difficulty could predict writing score, 
using a two-way analysis of variance, in which writing score was the dependent 
variable and topic difficulty was the dependent variable, we found that our 
predictions were almost exactly the reverse of what actually happen (see Table 
2). 

Table 2: Difficult y Judgments and Writing Scnrc$ 



ANALYSIS OF VARIANCE OF 8 .CATSCOR N- 8583 OUT OF 8583 
S ° URCE 0F SUM 0F SORS MEAN SQR F-STATISTIC S I GN I F 

5.2529 .0000 



BETWEEN 
W I TH IN 
TOTAL 



8 

8574 
8582 



413.31 
84327 . 
84740. 



51 .663 
9 .8352 

( RANDOM EFFECTS STATISTICS) 



ETA- .0698 ETA- SQR- .0049 (VAR COMP- .46927 -I %VAP AMONG- .47) 



SUMO iFF 


N 




MEAN 


VAR 1 ANCE 


STD 


OEV 


M) 


679 


8 


9455 


8.4439 


2 . 


9058 


(5) 


1 13 


8 


9823 


6 .5533 


2 . 


5599 


(6) 


737 


9 


1045 


9 .3872 


3 . 


0638 


(7) 


1539 


9 


4048 


10. 579 


3 . 


2526 


(8) 


2325 


9 


4705 


9 .5634 


3 . 


0925 


(9) 


1501 


9 


5776 


10.851 


3 . 


294 1 


( 10) 


1040 


9 


6519 


9 . 1242 


3 . 


0206 


(11) 


577 


9 


7660 


10.763 


3 . 


2807 


( 12) 


72 


9 


.4028 


7 . 1453 


2 


6731 


GRANO 


8583 


9 


.4394 


9 .8742 


3 


1423 



Mean writing score increased, rather than decreased, as topic difficulty 
increased, except for topics in the group judged as most difficult (those whose 
summed rating was 12, meaning all four judges had rated them as 3= difficult). 
As shown in Figure 1, topic difficulty as measured by "expert*' judgment is unable 
to explain any of the variance in MELAB writing score. 
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Figure 1: ANOVA 
Topic Difficulty and Wri ting Score 



ANALYSIS OF VARIANCE OF 8 CATSCOR N- 

SOURCE OF SLW SORS 

REORC jS ION 1 372. OS 

CRROR 8581 81368 . 

TOTAL U582 847*0. 

MULT R- 06626 R-SQR- .0009 



8583 OUT OF 10447 

MEAN SOR F-STAT S1GN1F 

3 72 . 05 17 fl^i 0000 

3 . 83?0 

3 . i356 



VARIABLE P Art T I AL COEFF STO ER^OR T-STAT S1GN1F 

CONSTANT 8.5291 . 15W9 56.190 0. 

16 .SUMDIFF .06626 . 1 1-456 . 18623 -1 6. 1515 .0000 



Further, while the effect of judged topic difficulty on writing score is significant 
(p = .OOO0), the magnitude of the effect is about 18 times smaller than would be 
expected, considering the relative lengths of the writing and topic difficulty 
scales. That is, since the writing scale is approximately twice as long as the topic 
difficulty scale (19 point? vs. 11 points), wc would expect, assuming "even" writing 
proficiency (i.e. that writing proficiency increases in steps that are all of equal 
width) that every 1-point increase in topic difficulty would be associated with a 2- 
point decrease in writing score; instead, the coefficient for topic difficulty effect 
(.11456) indicates that a 1-point increase in topic difficulty is actually, on 
average, associated with only about a 1/10-point increase in writing score. Also, 
it should be noted thai such an increase is of little practical consequence, since a 
change of less than a point in MELAB writing score would have no effect either 
on reported level of writing performance or on final MELAB score. 



Task Type Difficulty 

We had hypothesized vhat when topics were categorized according to topic 
type, the topic type categories would vary in judged difficulty level, and that the 
overall difficulty level of categories would vary along two continua: "orientation" 
(a private/public continuum), and "response mode" (an 
expository/argumentative continuum) (see Figure 2). 
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Figure 2: Response Mode, Orientation and Topic Difficulty 
Predictions 



private 



expository- 



argumentative 



public 



Tabic 3 shows the difficulty ratings for each category or "response mode": 

Tab 1 e 3: R esponse Modes and Difficulty Ratings 













Topic Category Groupings 


ExpPers 


Exp Pub 


ArgPers 


ArgPub 


Comb. 


* 


Dlff 


• 


Dtff 


i 


Dtff 


• 


Dlff 


• Dlff 


1 1A 


4 


40B 


7 


49A 


5 


37A 


8 


30B 6 


27A 


4 


10A 


8 


12B 


7 


39A 


8 


34A 7 


29B 


4 


32A 


8 


38A 


7 


43B 


8 


28A 8 


31B 


4 


41B 


8 


38B 


8 


21B 


9 


45A 8 


33B 


4 


49B 


8 


42A 


8 


22B 


9 


26B 1 1 


34B 


4 


12A 


9 


35A 


9 


24B 


9 




46B 


5 


18B 


9 


24A 


10 


31A 


9 




35B 


6 


23A 


9 


29A 


10 


33A 


9 




41A 


6 


10B 


1 1 


39B 


10 


46A 


9 




47A 


6 


18A 


1 1 


45B 


10 


11B 


10 




22A 


7 










13A 


10 




37B 


7 










42B 


10 




21A 


8 










13B 


11 




23B 


8 










27B 


1 1 




26A 


8 










44A 


1 1 




28B 


8 










50A 


12 




32B 


8 
















SOB 


9 
















30A 


10 
















473 


10 
















x diff=6.528 


r diff 


=8.746 


X diff=8.44( 


) T 


diff=8.93 


2 T diff=7.7!3 


x v/r 


=9.063 


t WT 


=9.398 


r wr.=9.3S9 


H 


wr =9.90 


i X wr.=9 517 



owrj.l x di££-T.9<55 
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We conducted an A NOVA, shown in Figure 3, which showed that our 
predictions were correct: prompts categorized as expository/private by judges 
are, on average, judged easiest and those categorized as argumentative/public 
are judged hardest. 



Figure 3: ANOVA 
Tn pir Difficulty Judgments and R f^nonsc Mode Difficulty 
Judgments 

ANALYSIS OF VARIANCE OF I6.SUM01FF N- 8497 OUT OF 6497 

SOURCE OF SUM OF SQRS MEAN SQR F-STATISTt C S I GN I f 



BE TWEEN 
WITHIN 
TOT Al 



4 6635. 0 2156.8 998.42 O. 

8-492 18361 . 2 . 1622 

8496 26996. < RANDOM EFFECTS STATISTIC) 



E T A- 



5656 ETA-SQR- .3199 (VARCOMP- 1.3219 %VAR AMONG- 37 9W 



CATEGORY 

EXPPR I 
EXPPU8 
ARGPRI 
ARGPU8 
COMB IN 

GRANO 



VARIANCE STO DEV 



2538 6.5284 

1210 87463 

1543 6.4407 

2417 8.9326 

789 7.7136 

8497 7.9854 



2 .6666 
1 . 6618 
1 . 7447 
1 .6462 
3 .0549 

3 . 1775 



1 . 6931 
1 .2891 
1 .3209 
1 .2838 
1 .7478 

1 .7826 



CONTRAST 
OBSERVED 



PRE01CTED 



-2.0986 -0. 
-2 7098 -0 . 

-1 . 7261 -0 



F -STAT Sl< 

892.49 O. 

1488.0 O. 

603.74 O. 



Since the two sets of judgments were made by the same judges, albeit six months 
apart, such a finding is to be expected. 



Jud gments Writing Scores 

When we looked at the relationships between our "expert" judgmenls of 
topic difficulty and task type, and compared them with writing scores, our 
predictions were not upheld by the data. We had hypothesized that topics m the 
category judged most difficult (argumentative/public) would get the lowest 
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scores, while topics in the category judged least difficult (expository/private) 
would get the highest scores, with topics in the other categories falling in 
between. To test this hypothesis, we conducted a two-way analysis of variance, in 
which writing score was the dependent variable and topic type the independent 
variable. The results of the ANOVA, shown in Figure 4, reveal that our 
predictions were exactly the reverse of what actually happened: on average, 
expository/private topics are associated with the lowest writing scores and 
argumentative/public the highest. 



Figure 4: ANOVA 
Writing Performance for Prompt Categories 



ANALYSIS OF VARIANCE OF 8.CATSCOR 

SOURCE DF SUM OF SORS 

BETWEEN 4 896.71 

wt THIN 6492 63137. 

TOTAL 6496 84034. 

ETA- 1033 ETA-SQR- .0107 (VAR 



N- C497 OUT OF 8497 

MEAN SOR F-STATISTIC S I GN 1 F 

224.16 22.899 .0000 

9.7900 

(RANDOM EFFECTS STATISTICS) 

4 

COMP- .13 141 *VAR AMONG- 1.32) 



CATEGORY N MEAN VARIANCE $TD DEV 

EXPPRl 2530 0.0634 6.964G 2.1976 

EXPPUB 1210 9. 3983 11.348 3.3667 

ARGPRI 1543 9.3597 9.9127 3.1484 

ARGPUB 2417 9.9040 9.6762 3.1107 

COMB IN 789 9.5171 10. 100 3. 1781 

GRAND 64 97 9. 4462 9.6910 3. 1450 



CONTRAST 

OBSERVED PREDICTED F-STAT SIGNIF 

-.80192 -0. 28.781 .0000 

- .87924 -0. 34.599 .0000 

-20941 -0. 1.9627 .1613 



We then looked at the combined effects of topic difficulty and prompt 
categories, predicting that topics with the lowest difficulty ratings and of the 
easiest (expository/private) type would get the highest writing scores, and that 
topics with the highest difficulty ratings and of the hardest 
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(argumentative/public) type would get the lowest writing scores. To test this, we 
again used a two-way analysis of variance, this time selecting writing ^orc as the 
dependent variable and topic difficulty and topic type as the independent 
variables. It should be noted that in order to be able to use ANOVA for this 
analysis, we had to collapse the number of difficulty levels from 9 to 2, in order 
to eliminate a number of empty cells in the ANOVA table (i.e. some topic types 
had only been assigned a limited range of difficulty ratings). The results of this 
analysis are shown i 1 Figure 5. 



Figure 5: ANOVA 
Topic Difficulty Judgments. Prompt Catcpories. and Writing 
Performance 



ULLifi type 



couht 



CEIX MEANS ST DEV 



cxpri 

• xpub 

argpri 

argpub 

comb in 

axpri 

•xpub 

argpri 

argpub 

comb in 



1647 
215 
290 
431 
399 
891 
995 
1253 
19*6 
390 



8.99454 
8.27442 
9.60690 
9.97680 
9.62406 
9.19080 
9.64121 
9.30247 
9.88822 
9.40769 



3.01525 
3.26895 
3.11886 
3.08627 
3.28066 
2.96185 
3.34214 
3.15372 
3 .11648 
3.06995 



SOURCE SUM OF SQUARES DF MEAN SQUARE 



TAIL PROB 



MEAN 
diffic 
typa 
dt 

ERROR 



451627.86928 
46.57869 
769.24715 
357.94852 
82750.52196 



848 



451627 .86938 
46.57869 
192.31179 
89.48713 
9.75027 



46319.54 



0.0 

0.0269 
0.0 

0.0000 



As the ANOVA suggests and Table 4 shows clearly, our predictions were 
again almost the reverse of what actually happened: expository/private topics 
judged easiest (expri 1), as a group had the second lowest mean writing score, 
while argumentative/public topics judged most difficult, as a group had the 
second highest mean writing score. 
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Tabic 4: 



Combined Effects of Tonic Difficulty and Topic Type 



X 


writing score 


tooic tvoe fc difficulty 


8 


27442 


expository/publ ic 


l 


8 


99454 


expository/private 


l 


9 


19080 


expository/private 


2 


9 


30247 


arqumentat ive/pr ivate 


2 


9 


40769 


combinat ion 


2 


9 


60690 


arqunentat ive/private 


1 


9 


.62406 


combinat ion 


1 


9 


64121 


oxpository/puol ic 


? 


9 


.88822 


arqunentat ive/public 


2 


9 


. 97680 


arqumentat ive/publ ic 


1 



IV. DISCUSSION 

Thus, patterns of relationship between topic difficulty, type and writing 
performance which we predicted based on commonly held assumptions were not 
matched by our writing score data. What we did find were unexpected but 
interesting patterns which should serve both to inform the item writing stage of 
direct writing test development, and to define questions about the effects of topic 
type and difficulty on writing performance which can be explored in future 
studies. 

Several intriguing questions for further study arise from possible 
explanations for the patterns we did discover in our data. One possible 
explanation is that our judges may have mispcrceived what is and is not difficulty 
for MELAB candidates to write about. A common perception about writing test 
topics is that certain types of topics are more cognitivcly demanding than others, 
and that writers sill have more difficulty writing on these. Yet, it may be that 
cither what judges perceive a. cognitivcly demanding to ESL writers is in fact 
not, or alternately, that is not necessarily harder for ESL writers to write about 
the topics judged as more cognitivcly demanding while some LI studies have 
concluded that personal or private topics arc easier for LI writers than 
impersonal or public ones, and that argumentative topics are more difficult to 
write on than topics calling for other discourse modes, these LI findings do not 
necessarily generalize to ESL writers. 

Another possible explanation for the patterns we discovered is that perhaps 
more competent writers choose hard topics and less competent writers choose 

68 



BEST COPY AVAILABLE 



12 



easy topics. In fact, there is some indication in our data that this may be true. We 
conducted a preliminary investigation of this question, using information 
provided by Part 3 scores of candidates in our datasct. The Pari 3 component is 
a 75-minutc multiple choice grammar/cloze/vocabulary/reading test, for which 
reliability has been measured at .96(KR21). The Pearson correlation between 
Part 3 and writing component scores is .73, which is generally interpreted to 
mean that both component are measuring, to some extent, general language 
proficiency. We assumed, for our investigation of the above question, that 
students with a high general language proficiency (as measured by Part 3) will 
tend to have high writing proficiency. In our investigation we examined mean 
indeed been chosen by candidates with higher mean Part 3 scores. We found this 
to be true for 15 out of 32-ncarly half—of the topic sets; thus, half of the lime, 
general language proficiency and topic choice could account for the definite 
patterns of relationship we observed between judged topic difficulty, topic tvpe 
and writing performance. One of these 15 sets, set 27, was used in a study by 
Spaan (1989), in which the same writers wrote on both topics in the set (A and 
B). While she found that, overall, there was not a significant difference between 
scores on the 2 topics, significant differences did occur for 7 subjects in her 
study. She attributed these differences mostly to some subjects apparently 
possessing a great deal more subject matter knowledge about one topic than the 
other. 

A further possible explanation for the relationship we observed between 
difficulty judgments and writing scores could be that harder topics, while perhaps 
more difficult to write on, push students toward better, rather than worse writing 
performance. This question was also explored through an investigation of topic 
difficulty judgments, mean Part 3 scores and mean writing scores for single 
topics in out datasct. We found in our datasct 3 topics whose means Part 3 
scores were below average, but whose mean writing scores were average, and 
which were judged as M hard"(1l or 12, argumentative/public). One of these 
topics asked writers to argue for or against US import restrictions on Japanese 
cars; another asked writers to argue for or against governments treating illegal 
aliens differently based on their different reasons for entering; the other asked 
writers to argue for or against socialized medicine. The disparity between Part 3 
and writing performance on these topics, coupled with the fact that they were 
judged as difficult, suggests that perhaps topic difficulty was an intervening 
variable positively influencing the writing performance of candidates who wrote 
on these particular topics. To thoroughly test this possibility, future studies could 
be conducted in which all candidates write on both topics in these sets. 

A related possibility is that perhaps topic difficulty has an influence, not 
necessarily on actual quality of writing performance, but on raters' evaluation of 
that performance. That is, perhaps MELAB composition raters, consciously or 
subconsciously, adjust their scores to compensate for, or even reward, choice ol 
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a difficult topic. In discussions between raters involved in direct writing 
assessment, it is not uncommon for raters to express concern that certain topics 
are harder to write on than others, and that writers should therefore be given 
"extra credit" for having attempted a difficult topic. Whether or not these 
concerns translate into actual scoring adjustments is an important issue for direct 
writing assessment research. 



V, CONCLUSION 

In sum v the findings of this study provide us with information about topic 
difficulty judgments and writing performance without which we could effectively 
proceed to design and carry out research aimed at answering the above 
questions. In other words, wc must first test our assumptions about topic 

difficulty, allowing us to form valid constructs about tcpie difficulty, allowing us 
to form valid constructs about topic difficulty effect; only then can we proceed to 
carry out meaningful investigation of the effect of topic type and difficulty on 
writing performance. 
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APPENDIX 1 

»CMI OttJSH UM3JKE ASSESSKXT BATTDTf 
COMPOS ITtON QLQ8AL PROF 1C I ENCY DESCRIPTIONS 
(Sm mwma for coapoctttan codss) 

97 

Topic li rtarVy and hi* dMrioptd radbta uaa of • Hda ranga of ayntactfc (aantanca tort) strict***, and acarits aorrftfcgfcai (wxd 
torn) control. Tham hi an rang* of apprccrtataty wd wcateaary. Ornaniatton b apprcprtata and aftactta. and tnara * sxaltant 
control of camctton. sputrg and punctuation appaar arrcr fn». 

93 

Tcpfc hi hily and ooapbxiy dr*top*i f tot* uaa of ■ Hda rangi of syntactic structural. H wtofcgj fcal control b nasrly itayt 
acarati. Vocabulary M broad and approprtataty uaacL Crgantntfcn ta wll oontrolbd and apprcprbta to tha latarbl. and ttm arltbg b «ii 
comectad. SosilanQ and punctuation irrort an not dbtractHQ. 

87 

Toot b Rill da>*fcpad, atth adoxartadgaaant of Its ccapbxfty. Vartad syntactic structtraa ire uaad atin aoae fbxfclllty. and ttwa s good 
■ortfoiogteal control. Vocabulary b broad and uaualty uaad ipproprbtaiy. Organtatton b controlbd and gwwally appropruta to tha 
■aterui. and then ari fan prop*** with come. tan. Spalltig and puxtwtlon arrant ari not dbtractt^. 

Tcot fcs OBnarally cbariy and coaptatafy cfcr*tapad. attn at bMt *cm adaobdpaartt of its coapbxlty. Both stota am eoapiax tyntactfc 
stnxturaa ari gacwaiv adaoataly «d; tlwa b adaoata aorphotagfcal control. Vtaeattfary uaa arwaa torn f bxblirty. art ta usually 
acpropritu. frpantattan b controlbd and show Km aooroprbcy to tna aatartal. and cennacttan b uaualty adaonta. SPalltq and 
tuxtuattan arrort ari scaattaac distract rig. 

77 

Tcpfc b dovabptd cbariy but not coapbtaly and attnout actovatadgTg Its ccapandty. vjth tbpta am coapbx cyntactb structural an 
pnaaant; htm TT aauyt tMu am cmXbmty and accurata* und aNta t> othan man b am f lancy and baa accuracy, 
atrphoaigicat control b rconbtant Vocabutary b adsouata. but my acaattaaa ba kwpproprbtofy uaad. Organization b gananliy 
controlbd. Mb oormctton b acaattaaa abcant or wuaoaatful. Spatting and purctuttan arron am acstHa* dbtractr^. 

73 

Topfc os^afcpaant b pmaant, attfougn itaftad by taajttatanaaa. tack of clarity, or bote of focus. Tna topic My ba traatad as tftxtfi it 
has on* ana dbanttn. or only on* pot* of van b poaafcb. h soaa *7T oaatya botn sbpb and ccaptaK syntactic structural am praaant. 
but «tn wnt amxi; otbam hm accurst* ayntsx but am wry mstrlrtad h tna ranga of tarnagt attaaptsd. bxpnobgfcal control b 
taTEbtart. Vooabubry b acawtbat htdsoata, and acnatbaa bappreprbtHy uaad. *ganfc b partbity controlbd. ilXb comacttan ts 
oftan aoaant or imcoaashi. 4*iitq and ptirbjattan arrom am scavtbn dbtracth0. 

87 

Copte dtwb pw nt b pmsant but rattrfctad. and oftan tvospbto or uxbsr. Sbpb syntaetb ttructum doatsto. ttth aw arrom; 
OL^aaK lyntactto stnxbm, K pnaaant am not oontro«ad Laotaa aorpnotagloat oantrol. Harrow and sbob wabubry uauiMy approxbatas 
mv\*Q but b oftan a«^roprbta(y uaad. Organtatlon. town appamnt. b poorly controlbd, and Httb or no comacttan b apparant. 
Z»\\*q and punctuation arrom am oftan dbtmcttaQ. 

83 

Ccntihi httb tip of txpfc n s'i bp aan r . $bpb syitactb stnetuma am pnaaant. but nth aany arrom; bcks aorphoiogical ocntrol. mm* 
and sbpb vocabubry rr*«a ooaauracatton. Tham b Httb or no organtatlon. and no amotion apparant. Stalling and puxttattan «ror& 
often cauao sarbus htrfsrana. 

57 

Oftan Ktrsatly snort; ocntahi orrfy fragaantary ooaauiaatbn about tna topic. Tham b Kttta t)fitactb or aorpMtagfcal control. 
Vccabubry b tto^tf restrbtad and mount** uaad. Ho organbattan or cermet tan am appamnt. SpaWng b oftan mucbrarabb and 
pumattan b tsutag or appaars rsndoa. 

53 

txtraaa* arort, UKbiV sbout 40 anrda or baa. Coavmoataa notr*g. and b oftan oopbd ovactty froi tna proapt. Tham b llttta abji of 
syntactb or svphotogMI control. Vocabubry b artraaaty mstrttsd and rapaiftNafy uaad. Thra b no acotrant organtzitbn or 
comacttan. SpaUfcQ b cf tan hdactaharabai and puxtmttan b aasttag or appaam rsndoa. 

HO.T. 

MAT. (Hot On Tcpfc) ratals* a coapoattJon arfttan on a tspto coapbtafy dHfarsnt froa my of ttm* aaatjad; ft torn not Meats that i 
arltar has sarafy «gr*caad t.-caa or abrrtarprstad a tcpfc. KjO.T. coapoanbna oftan appaar prspamd and aaaortzsd. Thay am not assbnad 
senraa or codaa. 1/10/90 
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JVPPF.IIDIX 1 (CONT'D) 



MCHGW EWLOI UK2KE ASSCSStOfT WTTDTC 
COMPOSITION CODES 
(Stt rMm for oavotltlan gtttl proftetancy dwcrfctton*) 



WIT: tJ» oodM v* iMPt to Moot* ttMt • MTU* fttfiLT* tt BWCUUY MB • Ml M COP4M90M TO TIC WWII LEWL OF TX WRW 
COPE 1KTPHCUTW 



t 


topfc hjpickiity party or tvacbtoty dtwhvtd 


b 


topic Moacfiity mn o***x»d 


c 


orgrtattan Moctity toppraprtatt to wtru 


d 


orgv^zattan «wcfclty ircorrtroUid 


• 


orpwattan «sc«chiliy «ni oantroMid 


f 


ccrractlan Mptcltlly poor 


0 


corractlon «p«cU!ty wroth 


h 


syntACtk ;«r,t«» ItwO *tnrtur« «**cUify ttaplt 


1 


tyr.Uctfc rtucvu«c tipochfly cam** 


J 


tyntrtt ttrjctum ovocfcfly inxntro(M 


k 


synucth: stnctirts nptdiHy oontroUid 


1 


Mptcltlhr poor *»uht*cMI (wrd form) comrot 


■ 


Mptclttiy good lorptvtogicti control 


n 


vocafauhry apocfctly mm* 


0 


vocolxJry apocuy orad 


P 


wcohivy um MpocHTy tapprcorfet* 


q 


wcrtuhry um oxmckWf voroprtotft 


r 


**i*q ojpocfcRy twxuntM 


• 


(xrctLBttan wptbily Noounti 




otngmft dMitor* atort\j or appwAV nndoi 


u 


tandvftl^ M^blo or notrty Mipjtito 


¥ 


OMtton •fcrtaprtttd or not addnoMd 


■ 


rvtMMi mm kxti hMi far tnut dhertra 


X 


otfw («it*-4t ■>* ocaro rapart) 



75 





APPENDIX 2: Samples of Topic Categories 



Type I: EXPOSITORY/PRIVATE 

When you go to a party, do you usually talk a lot, or prefer to listen? What does 
this show about your personality? 



Type 2: EXPOSITORY/PUBLIC 

Imagine that you arc in charge of establishing the first colony on the moon. 
What kind of people would you choose to take with you? What qualities and 
skills would they have? 



Type 3: ARGUMENTATIVE/PRIVATE 

A good friend of yours asks for advice about whether to work and make money 
of whether to continue school. What advice would you give him/her? 

Type 4: ARGUMENTATIVE/PUBLIC 

What is you opinion of mercenary soldiers (those who are hired to fight for a 

country other than their own?) 

Discuss. 



Type 5: COMBINATION (ARC-UMENTATIVE/EXPOSITORY/PUBLIC) 

People who have been seriously injured can be kept alive by machines. Do you 
think they should be kept alive at great expense, or allowed to die? Explain your 
reasons. 




